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Abstract 

We present fast strongly universal string hashing families: they can process data at 
a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that these families — 
though they requires a large buffer of random numbers — are often faster than popular 
hash functions with weaker theoretical guarantees. Moreover, conventional wisdom is 
that hash functions with fewer multiplications are faster. Yet we find that they may fail 
to be faster due to operation pipelining. We present experimental results on several pro- 
cessors including low-powered processors. Our tests include hash functions designed 
for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also 
prove, using accessible proofs, the strong universality of our families. 

Keywords: String Hashing, Superscalar Computing, Barrett Reduction, Carry-less 
multiplications. Binary Finite Fields 



1. Introduction 

For 32-bit numbers, random hashing with good theoretical guarantees can be just 
as fast as popular alternatives lUl. In turn, these guarantees ensure the reliability of 
various algorithms and data structures: frequent-item mining |2|, count estimation |[3j 
|4J, and hash tables [5, 6|. We want to show that we can also get good theoretical 
guarantees over larger objects (such as strings) without sacrificing speed. For example, 
we consider variable-length strings made of 32-bit characters: all data structures can 
be represented as such strings, up to some padding. 

We restrict our attention to hash functions mapping strings to L-bit integers, that 
is, integer in [0, 2^) for some positive integer L. In random hashing, we select a hash 
function at random from a family Li .8J. The hash function can be chosen whenever 
the software is initiahzed. While random hashing is not yet commonplace, it can have 
significant security benefits f9l in a hash table: without randomness, an attacker can 
more easily exploit the fact that adding n keys hashing to the same value typically 
takes quadratic time {Q{n?)). For this reason, random hashing was adopted in the 
Ruby language as of version 1.9 ifTOl and in the Perl language as of version 5.8.1. 
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A family of hash functions is fc-wise independent if the hash values of any k distinct 
elements are independent. For example, a family is pairwise independent — or strongly 
universal — if given any two distinct elements s and s', their hash values h{s) and h{s') 
are independent: 

P{h{s) = y\h{s') = y') = P{h{s) = y) 

for any two hash values y, y' . When a hashing family is not strongly universal, it can 
still be universal if the probability of a collision is no larger than if it were strongly 
universal: P{h{s) ~ h{s')) < 1/2^ when 2^ is the number of hash values. If the 
collision probability is merely bounded by some e larger than 1/2^ but smaller than 
1 {P{h{s) = h{s')) < e < 1), we have an almost universal family. However, strong 
universality might be more desirable than universality or almost universality: 

• We say that a family is uniform if all hash values are equiprobable (P{h{s) = 
y) = 1/2^ for all y and s): strongly universal families are uniform, but universal 
or almost universal families may fail to be uniform. To see that universality fails 
to imply uniformity, consider the family made of the two functions over 1-bit 
integers (0,1): the identity and a function mapping all values to zero. The prob- 
ability of a collision between two distinct values is exactly 1/2 which ensures 
universality even though we do not have uniformity since P{h{0) = 0) = 1. 

• Moreover, if we have strong universality over L bits, then we also have it over 
any subset of bits. The corresponding result may fail for universal and almost 
universal famihes: we might have universality over L bits, but fail to have almost 
universality over some subset of bits. Consider the non-uniform but universal 
family {h{x) = x} over L-bit integers: if we keep only the least significant 
L' bits (0 < i' < L), universality is lost since h{0) mod 2-^' = h{2^') mod 

There is no need to use slow operations such as modulo operations, divisions or 
operations in finite fields to have strong universality. In fact, for short strings having 
few distinct characters, Zobrist hashing requires nothing more than table look-ups and 
bitwise exclusive-or operations, and it is more than strongly universal (3-wise indepen- 
dent) lflTl[T2l . Unfortunately, it becomes prohibitive for long strings as it requires the 
storage of nc random numbers where n is the maximal length of a string and c is the 
number of distinct characters. 

A more practical approach to strong universality is Multilinear hashing (§|2|. Un- 
fortunately, it normally requires that the computations be executed in a finite field. 
Some processors have instructions for finite fields (§ |4| or they can be emulated with 



a software library (§ 5.3 1. However, if we are willing to double the number of random 
bits, we can implement it using regular integer arithmetic. Indeed, using an idea from 
Dietzfelbinger 1 13 1, we implement it using only one multiplication and one addition per 
character (§[3]l. We further attempt to speed it up by reducing the number of multipli- 
cations by half. We believe that these families are the fastest strongly universal hashing 
families on current computers. We evaluate these hash famihes experimentally (§|5]l: 

• Using fewer multiplications has often improved performance, especially on low- 
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power processors f\4\. Yet trading away the number of multiplications fails to 
improve (and may even degrade) performance on several processors according to 
our experiments — which include low-power processors. However, reducing the 
number of multiplications is beneficial on other processors (e.g., AMD), even 
server-class processors. 

We also find that strongly universal hashing may be computationally inexpensive 
compared to common hashing functions, as long as we ignore the overhead of 
generating long strings of random numbers. In effect — if memory is abundant 
compared to the length of the strings — the strongly universal Multilinear family 
is faster than many of the commonly used alternatives. 

We consider hash functions designed for hardware supported carry-less mul- 
tiplications (§ |4]). This support should drastically improve the speed of some 
operations over binary finite fields (GF{2^)). Unfortunately, we find that the 



carry-less hash functions fail to be competitive (§ 5.4 1 



2. The Multilinear family 

The Multilinear hash family is one of the simplest strongly universal family |7|. It 
takes the form of a scalar product between random values (sometimes called keys) and 
string components, where operations are over a finite field: 

n 

h{s) = mi + ^ nii+iSi. 

i=l 

The hash function h is defined by the randomly generated values mi,m2, ... It is 
strongly universal over fixed-length strings. We can also apply it to variable-length 
strings as long as we forbid strings ending with zero. To ensure that strings never end 
with zero, we can append a character value of one to all variable-length strings. 

An apparent limitation of this approach is that strings cannot exceed the number of 
random values. In effect, to hash 32-bit strings of length n, we need to generate and 
store 32(n+l) random bits using a finite field of cardinality 2^^. However, Stinson lT5l 
showed that strong universality requires at least 1 + a{b — 1) hash functions where a is 
the number of strings and b is the number of hash values. Thus, if we have 32-bit strings 
mapped to 32-bit hash values, we need at least « 2'^^("+^^ hash functions: Multilinear 
is almost optimal. 

Hence, the requirement to store many random numbers cannot be waived without 
sacrificing strong universality. Note that Stinson's bound is not affected by manipula- 
tions such as treating a length n string of ly-bit words as a length n/2 string of 2VF-bit 
words. 

If multiplications are expensive and we have long strings, we can attempt to im- 
prove speed by reducing the number of multiplications by half llT6l ITtI : 

n/2 

h{s) = mi + ^(TO2i + S2i-l){m2i+l + S2i). (1) 
i=l 
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While this new form assumes that the number of characters in the string is even, we 
can simply pad the odd-length strings with an extra character with value zero. With 
variable-length strings, the padding to even length must follow the addition of a char- 
acter value of one. 

Could we reduce the number of multiplications further? Not in general: the compu- 
tation of a scalar product between two vectors of length n requires at least [n/2] mul- 
tiplications ifTSl Corollary 4]. However, we could try to avoid generic multiplications 
altogether and replace them by squares Iil4i : 



h{s) = nil + ^{irii+i + Si)'^ 



i=l 



Indeed, squares can be sometimes be computed faster Unfortunately, this approach 
fails in binary finite fields (GF{2^)) because 



(mi+i + Si)^ = mf^i + rUi+iSi + nii+iSi + 

2 I 2 

since every element is its own additive inverse. Thus, we get 



h{s) ^mi+Y^ m\^i + ^ s- 



„2 

i=l i=l 



which is a poor hash function (e.g., /i(ab) = /i(ba)). 

There are fast algorithms to compute multiplications fT9l- Bn in binary finite fields. 
Yet these operations remain much slower than a native operation (e.g., a regular 32- 
bit integer multiplication). However, some recent processors have support for finite 
fields. In such cases, the penalty could be small for using finite fields, as opposed to 
regular integer arithmetic (see §|4]and § 5.4 1. (Though they are outside our scope, there 



are also fast techniques for computing hash functions over a finite field having prime 
cardinality Il22l .') 



3. Making Multilinear strongly universal in the ring Z/2^Z 

On processors without support for binary finite fields, we can trade memory for 
speed to essentially get the same properties as finite fields on some of the bits using 
fast integer arithmetic. For example, Dietzfelbinger ifTH showed that the family of 
hash functions of the form 

hA,B[x) = {Ax + B mod 2^) ^ 2^"^ 

where the integers A,Bg [0, 2^) and x G [0, 2^) is strongly universal foT K > L — 1. 
(For fewer parentheses, we adopt the convention that Ax + B mod 2^ = [Ax + 
B) mod 2^. The symbol 4- denotes integer division: x ^ y — [x/y\ for positive 
integers.) We generalize Dietzfelbinger hashing from the linear to the multilinear case. 
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The main difference between a finite field and common integer arithmetic (in the 
integer ring Z/2^Z) is that elements of fields have inverses: given the equation ax = b, 
there is a unique solution x = a~^b when a ^ 0. However, the same is "almost" true 
in integer rings used for computer arithmetic as long as the variable a is small. For 
example, when a = 1, we can solve for ax — b exactly (x = b). When a = 2, then 
there are at most two solutions to the equation ax = b. We build on these observations 
to derive a stronger result. 

We let r = trailing(a) be the number of trailing zeros of the integer a in binary 
notation. For example, we have that trailing(2-') = j. 

Proposition 1. Given integers K, L satisfying K > L — 1 > 0, consider the equation 

{ax + c mod 2^) ^ 2^-^ = b 

where a is an integer in [1, 2^), 6 is an integer in [0, 2^~^+^) and c an integer in 
[0, 2^). Given a, b and c, there are exactly 2^~^ integers x in [0, 2^) satisfying the 
equation. 

Proof. Let r — trailing(a). We have r < L — 1 since a E [1,2^). Because a is 
non-zero, we have that a' = a 2^ is odd and a = 2'^a' . 
We have 

{{ax + c) mod 2^) ^ 2^-^ = ((2^a'.T + c) mod 2^) ^ 2^-^ 

= (2^[a'a; + (c^2^)] + (cmod2^) mod 2^^) ^ 2^"^ 

We show that the term (c mod 2^) can be removed. Indeed, the t least significant 
bits of 2'^[a'x + {c ^ 2'^)] + (c mod 2'^) are those of c mod 2"^ whereas the more 
significant bits are those of 2'^[a'x + (cH- 2"^)]. The final division by 2^^^ will dismiss 
the L — 1 least significant bits, and r < L — 1, so that the term (c mod 2'^) can be 
ignored. 

Hence, we have 

{ax + c mod 2^) ^2^-1 = (2^[a'a; + (c^2^)] mod 2^) ^ 2^"! 

= (2^ [a'j;+ (c^2^) mod 2^^"^] ) ^ 2^-^ 

= {a'x + (c ^ 2^) mod 2^"^) ^ 2^-^-^ 

= {a'{x mod 2^-^) + (c ^ 2^) mod 2^"^) ^ 2^-^-^. 

Setting x' = X mod 2^^^^ and c' = c 2'^, we finally have 

{ax + c mod 2^') H-2^-1 = (a'a;' + c' mod 2^^"^) ^ 2^-^-^. 

Let z be an integer such that z +- 2^^^^'^ = b. Consider a'x' + c' mod 2^^"^ = z. 
We can rewrite it as a'x' mod 2^^^^ = z — c' mod 2^^^. Because a' is odd, a' and 
2^~'^ are coprime (their greatest common divisor is 1). Hence, there is a unique integer 
x' e [0, 2^'-^) such that a'x' mod 2^^^ = z - c' mod 2^-^ [23, Cor. 3L25]. 

Given 6, there are 2^~^~'^ integers z such that z +- 2^^^^^ = b. Given x', there are 
2'^ integers x in [0, 2^) such that x' = a; mod 2^^'^. It follows that there are 2^^^^^^ x 
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T' = 2^-1 integers x in [0, 2^) such that {a'x' + c' mod 2^-^) 2^-1"^ = 5 
holds. □ 



Example 1. Consider the equation (6x+10 mod 64) 4 — 5. By Proposition^ there 
must be exactly 4 solutions to this equation (setting K — 6, L — 3). We can find them 
using the proof of the lemma. The integer 6 has 1 trailing zero in binary notation (110) 
so that T ~ 1. We can write 6 = 2 x 3 so that a' — 3. Similarly, c' — 10 2 — 5. 
Hence we must consider the equation ix' + 5 mod 2^ = z for values of z such that 
z 2 ~ 5. There are two such values: z = 10 and z = 11. We have that 

3x' + 5 mod 32 10 =^ 3a;' mod 32 = 5 =^ = 23 and 
3x' + 5 mod 32 = 11 =^ 3x' mod 32 = 6 a;' = 2. 

It remains to solve for x in x' = x mod 32 with the constraint that x is an integer 
in [0,64). When x' = 2, we have that x € {2,34}. When x' — 23, we have that 
X S {23, 55}. Hence, the solutions are 2, 23, 34 and 55. 

Using Proposition [T] we can show that fast variations of MuhiHnear are strongly 
universal even though we use regular integer arithmetic, not finite fields. 

Theorem 1. Given integers K, L satisfying K > L — 1 > 0, consider the families of 
(K — L + l)-bit hash functions 

• Multilinear.- 

h{s) = ^ ^mi + ^ TOi+iS, j mod 2^ j 2^"^ 

• MULTILINEAR-HM.- 

Ks) = ^ ^TOl + 51 ("^2^ + S2^-l)(TO2^+l + S2r)^ mod 2^ ^ ^ 2^-^ 

which assumes that n is even. 

Here the mi 's are random integers in [0, 2^) and the string characters Si are integers 
in [0, 2^). These two families are strongly universal over fixed-length strings, or over 
variable-length strings that do not end with the zero character We can apply the second 
family to strings of odd length by appending an extra zero element so that all strings 
have an even length. 

Proof. We begin with the first family (Multilinear). Given any two distinct strings 
s and s' , consider the equations h{s) = y and h(s') — y' for any two hash values y 
and y'. Without loss of generality, we can assume that the strings have the same length. 
If not, we can pad the shortest string with zeros without changing its hash value. We 
need to show that P{h{s) = y A h(s') = y') = 2'^^^^^^^\ Because the two strings 
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are distinct, we can find j such that Sj ^ s'j. Without loss of generality, assume that 

4 -5, e [0,2^). 



We want to solve the equations 




(2) 
(3) 



for integers mi, m2, ... in [0, 2^). 
Consider the following equation 



nil 



^^mi+is,; ] mod 2^ = 



i=l 



There is a bijection between rrii and z G [0, 2^). That is, for every value of mi, there 
is a unique z, and vice versa. Specifically, we have 



mi 



^^mi+iSi mod 2^. 



If we choose z such that z~2^ ^ = y, we effectively solve Equation[2] By substitution 
in Equation [3] we have 

\ 

2j+i{s'j - Sj) + z + rn,+i{s',-Si) mod 2^ ^ 2^ ^ y'. 

This equation is independent of mi. By Proposition[T| there are exactly 2^^^ solutions 
mj+i to this last equation. (Indeed, in the statement of Proposition [T] substitute mj+i 
for X, s'j — Sj for a, z + J27^j i=i "^i+i i^'i ~ ^i) 2^ for 6 and y' for h.) 

Meanwhile, there are 2^~^ possible values z such that z ^ 2^~^ — y. Because 
there is a bijection between mi and z, there are also 2^~^ possible values for mi. 

So, focusing only on mi and rrij^i, there are 2^~^ x 2^~^ values satisfying h{s) = 
y and h{s') — y' . Yet there are 2^ x 2^ possible pairs mi , ruj+i. Thus the probabihty 
that h{s) = y and h{s') = y' is ^^2^x2^ ^ = 2'^^^^^^'^'> which completes the proof 
for the first family. 

The proof that the second family (MULTILINEAR-HM) is strongly universal is sim- 
ilar. As before, set z in [0, 2^) such that z 2^^^ — y. Solve for mi from the first 
equation: 



n/2 

mi = z - ^(m2i + S2i-i)(m2i+i + S2i) mod 2 



K 



1 



Then by substitution, we get 



' n/2 

^(m2i + S2i-l)("^2j+l + sJjj) - {m2i + S2i-l){m2i+l + S2i) 



K \ . nL-1 „/ 



+ z mod 2" M- 2 



We canrewrite this last equation, as either ((mj(s^ — Sj)+/9+z mod 2^)~2^ ^ — v' 
if i is even or as ((mj+i(s^ — Sj) + p + z mod 2^)-^2-^^^ = y' if j is odd, where p is 
independent of either rrij {j even) or ij odd). As before, by Proposition [T] there 
are exactly 2^^^ solutions for nij {j even) or mj^2 ij odd) if z is fixed. As before, 
there are 2^^^ distinct possible values for z, and 2^^^ distinct corresponding values 
for mi. Hence, the pair mi, rrij can take 2^^^ x 2^^^ distinct values out of 2^ x 2^ 
values, which completes the proof. □ 

To apply Theorem [T] to variable-length strings, we can append the character value 
one to all strings so that they never end with the character value zero, as in §|2] If we use 
MULTILINEAR-HM, we should add the character value one before padding odd-length 
strings to an even length. 

Theorem [T] is both more general (because it includes strings) and more specific 
(because the cardinality of the set of hash values is a power of two) than a similar 
result by Dietzfelbinger [13, Theorem 4]. However, we believe our proof is more 
straightforward: we mostly use elementary mathematics. 

While Dietzfelbinger did not consider the multilinear case, others proposed vari- 
ations suited to string hashing. Patra§cu and Thorup |24| state without proof that 
MULTILINEAR-HM over strings of length two is strongly universal for K — 64, L = 
32. They extend this approach to strings, taking characters two by two: 



K-"^) = I I G3(^"3i-2 + S2i-i)(m3j_i + S2i) + nizi I mod 2^ 



where is the bitwise exclusive-or operation and n is even. Unfortunately, their ap- 
proach uses more operations and requires 50% more random numbers than MULTILINEAR- 
HM. They also refer to an earlier reference ll25l where a similar scheme was erro- 
neously described as universal, and presented as folklore: 



^('') = ^("^21+1 + S2i+i){m2i+2 + S2i+2) mod 2^ 





where n is even. To falsify the universality of this last family, we can verify nu- 
merically that for ii' — Q,L = 3, the strings 0,0 and 2,6 collide with probability 
> p^. In any case, we see no benefit to this last approach for long strings because 
MULTILINEAR-HM is likely just as fast, and it is strongly universal. 
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3.1. Implementing MULTILINEAR 

If 32-bit values are required, we can generate a large buffer of 64-bit unsigned 
random integers m^. The computation of either 

h{s) = imi+^rrii+is, mod 2^^ | 2^^ 



n/2 

h{s) = mi + ^(m2j + S2i_i)(TO2j+i + S2i) mod 2'^* 2^ 



is then a simple matter using unsigned integer arithmetic common to most modern 
processors. The division by 2^^ can be implemented efficiently by a right shift (>>32). 

One might object that according to Theorem [T] 63-bit random numbers are suffi- 
cient if we wish to hash 32-bit characters to a 32-bit hash value. The division by 2^^ 
should then be replaced by a division by 2^^ . However, we feel that such an optimiza- 
tion is unlikely to either save memory or improve speed. 

Multilinear is essentially an inner product and thus can benefit from multiply- 
accumulate CPU instructions: by processing the multiplication and the subsequent ad- 
dition as one machine operation, the processor may be able to do the computation faster 
than if the computations were done separately. Several processors have such integer 
multiply-accumulate instructions (ARM, MIPS, Nvidia and PowerPC). Comparatively, 
we do not know of any multiply-xor-accumulate instruction in popular processors. 

Unfortunately, some languages — such as Java — fail to support unsigned integers. 
With a two's complement representation, the de facto standard in modem processors, 
additions and multiplications give identical results, up to overflow flags, as long as no 
promotion is involved: e.g., multiplying 32-bit integers using 32-bit arithmetic, or 64- 
bit integers using 64-bit arithmetic. However, we must still be careful: promotions and 
divisions differ when we use signed integers: 

• If we store string characters using 32-bit integers (int) and random values as 
64-bit integers (long), Java will sign-extend the 32-bit integer to a 64-bit inte- 
ger when computing rrii+i * Si, giving an unintended result for negative string 
characters. Use m^+i * (si&OxFFFFFFFFl) instead. 

• Unsigned and signed divisions differ. Correspondingly, for the division by 2^"^ — 
to retrieve the 32 most significant bits — the unsigned right-shift operator (>>>) 
must be used in Java, and not the regular right shift (>>). 

Because we assume that the number of bits is a constant, the computational com- 
plexity of Multilinear is Hnear (0(n)). Multilinear uses n multiplications, 
n additions, and one shift, whereas MULTILINEAR-HM uses n/2 multiplications, 
3n/2 additions, and one shift. In both cases we use 2n + 1 operations, although there 
may be benefits to having fewer multiplications. (Admittedly, Single Instruction, Mul- 
tiple Data (SIMD) processors can do several instructions at once.) 
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Consider that we need at least « 32 {n + 1) random bits for strongly universal 32- 
bit hashing of 32n bits 1 15 1. That is, we must aggregate w 64n + 32 bits into a 32-bit 
hash value. Assume that we only allow unary and binary operations. A 32-bit binary 
operation maps 64 bits to 32 bits, a reduction of 32 bits. Hence, we require at least 
2n 32-bit operations for strongly universal hashing. Alternatively, we require at least 
n 64-bit operations. Hence, for n large, MULTILINEAR and MULTILINEAR-HM use 
at most twice the minimal number of operations. 

3.2. Word size optimization 

The number of required bits is application dependent: for a hash table, one may 
be able to bound the maximum table size. In several languages such as Java, 32- 
bit hash values are expected. Meanwhile the key parameters of our hash functions 
Multilinear and Multilinear-HM are L (the size of characters) and K (the size 
of the operations), and these two hash functions deliver K — L + 1 usable bits. 

However, both K and L can be adjusted given a desired number of usable random 
bits. Indeed, a length n string of L-bit characters can be reinterpreted as a length 
n\L/L'~\ string of L'-bit characters, for any non-zero L'. Thus, we can either grow L 
and K or reduce L and K, for the same number of usable bits. 

To reduce the need for random bits, we should use large values of K. Consider a 
long input string that we can represent as a string of 32-bit or 96-bit characters. Assume 
we want 32-bit hash values. Assume also that our random data only comes in strings 
of 64-bit or 128-bit characters. If we process the string as a 32-bit string, we require 
64 random bits per character The ratio of random strings to hashed strings is two. If 
we process the string as a 96-bit string, we require 128 random bits per character and 
the ratio of random strings to hashed strings is 128/96 = 4/3 w 1.33. What if we 
could represent the string using 224-bit characters and have random bits packaged into 
characters of 256 bits? We would then have a ratio of 8/7 « 1.14. 

We can formalize this result. Suppose we require z pairwise independent bits and 
that we have M input bits. Stinson 115^ showed that this requires at least 1 + 2^^ (2^ — 
1) hash functions or, equivalently, log(l + 2*^(2^ — 1)) random bits. Thus, given any 
hashing family, the ratio of its required number of random bits to the Stinson limit 
(henceforth Stinson ratio) must be greater or equal to one. The M input bits can be 
represented as an L-bit n-character string for M — nL. Under MULTILINEAR (and 
MULTILINEAR-HM), we must have z = K - L + 1. Thus we use K{n + 1) = 
{z + L - 1){\M/L] + 1) random bits. We have that (z + L - l)([Af/L] + 1) < 
{z + L — 1){M/L + 2) which is minimized when 



Rounding L = y/{z - l)M/2 up and substituting it back into (z+L- 1) ( \M /L] + 1), 
we get an upper bound on the number of random bits required by MULTILINEAR. This 
bound is nearly optimal when \M/L\ « M/L, that is, when M is large. Unfor- 
tunately, this estimate fails to consider that word sizes are usually prescribed. For 
example, we could be required to choose K S {8, 16, 32, 64}. That is, we have to 
choose L e {9 — z, 17 — z, 33 — z, 65 — z}. Fig.[T|shows the corresponding Stinson 
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Figure 1 : For large inputs, MULTILINEAR requires an almost optimal number of ran- 
dom bits when arbitrary word sizes (A') are allowed. It has lower efficiency when the 
word size is constrained. The plot was generated for 32-bit hash values {z = 32). 

ratios. When there are many input bits (M 1), the ratio of MULTILINEAR con- 
verges to one. That is, as long as we can decompose input data into strings of large 
characters (having w \/{z — l)AI/2 bits), MULTILINEAR requires almost a minimal 
number of bits. This may translate into an optimal memory usage. (The result also 
holds for MULTILINEAR-HM except that it is slightly less efficient for strings having 
an odd number of characters.) If we restrict the word sizes to common machine word 
sizes {K g {8, 16, 32, 64}), the ratio is w 2 for large input strings. We also consider 
the case where we could use 128-bit words (K £ {8, 16, 32, 64, 128}). It improves the 
ratio noticeably (« 1.33), as expected. 

We can also choose the word size (K) to optimize speed. On a 64-bit processor, 
setting K ^ 64 would make sense. We can compare this default with two alternatives: 

1. We can try to support much larger words using fast multiplication algorithms 
such as Karatsuba's. We could merely try to minimize the number of random 
bits. However, this ignores the growing computation cost of multiplications over 
many bits, e.g., Karatsuba's algorithm is in ri(n^-^^). For simplicity, suppose 
that the cost of K-hit multiplication costs K"' time for a > 1. Roughly speaking, 
to hash M bits, we require \M/L \ multiplications. When we have long strings 
(M ^ L), we can simplify \M/L\ w M/L. If we desire z-bit hash values, 
then we need to use multiplication oviK = z + L~l bits. Thus, the processing 
cost can be (roughly) approximated as Ml^+LnD^^ Starting from L = 1, this 
function initially decreases to a minimum at 



before increasing again as ^. (When a = 1.5 and z = 32, we have ^— = 
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10 I ' ' ' ' ' 1 

32 64 128 256 512 1024 2048 

Input bits 

Figure 2: Modeled computational cost per bit as a function of the number of bits per 
character for 32-bit hashing values (z = 32) and a = 1.5. 



62.) See Fig. |2] Hence, while we can minimize the total number of random bits 
by using many bits per character (L large), we may want to keep L relatively 
small to take into account the superlinear cost of multiplications. 
We can support 128-bit words on a 64-bit processor, with some overhead. (Re- 
cent GNU GCC compilers have the __uintl28 type, as a C-language extension.) 
A single 128-bit multiplication may require up to three 64-bit multiplications. 
However, it processes more data: with z = 32 hashed bits, each 128-bit mul- 
tiplication hashes 97 input bits. Comparatively, setting K — 64, we require 
a single 64-bit multiplication, but we process only 32 bits of data. (Formally, 
we could process 33 bits of data, but for convenient implementation, we process 
data in powers of two.) Hence, it is unclear which approach is faster: three 64-bit 
multiplications and 128 bits of random data to process 97 input bits, or a single 
64-bit multiplication and 64 bits of random data, to process 33 input bits. How- 
ever, the 128-bit approach will use 33% fewer random bits. Going to 256-bit 
word sizes would only reduce the number of random bits by 14%: using larger 
and larger words leads to diminishing returns. 



We assess these two alternatives experimentally in § 5.5 



4. Fast Multilinear with carry-less multiplications 

To help support fast operations over binary finite fields (GF(2^)), AMD and In- 
tel introduced the Carry-less Multiplication (CLMUL) instruction set 1261 . Given the 
binary representations of two numbers, a — J2i^=i o-i^^^^ ^"d h = X^iLi ^i2*~^, the 
carry-less multiplication is given by c = Yld=i^ Ci2^~^ where c; = ^j^j-i- 
(We write a -k b — c.) If we represent the two L-hit integers a and b as polynomials 
in GF(2)[a;], then the carry-less multiplication is equivalent to the usual polynomial 
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multiplication: 




2L-1 



i=l 



With a fast carry-less computation, we can compute Multilinear efficiently. Given 
any irreducible polynomial p{x) of degree L, the field GF(2)[x]/p(a;) is isomorphic to 
GF(2^). Hence, we want to compute h{s) — mi + X^ILi "^i+i'^i over GF(2)[a::]/p(a;). 
(Similarly, we can use Equation [T] to reduce the number of multiplications by half.) 
Computing all multiplications over GF(2)[a;]/p(a;) would still be expensive given fast 
carry-less multiplication. Instead, we first compute mi + over GF(2)[a;] 

and then return the remainder of the division of the final result by p{x). Indeed, think of 
the values mi, m2, . . . and si, S27 • • ■ as polynomials of degree at most L in GF(2)[a;]. 
Each of the n multiplications in GF(2)[.t] is equivalent to a carry-less multiplication 
over L-bit integers which results in a 2L— 1-bit value. Similarly, each of the n additions 
in GF(2)[a;] is an exclusive-or operation. That is, we want to compute the 2L — 1-bit 
integer 



h{s) = TOi © ffi mi+i * Si . (6) 




Finally, considering h{s) as an element of GF(2)[a;], noted q{x), we must compute 
q{x) /p{x). The remainder (as an L-bit integer) is the final hash value h{s). 

If done naively, computing the remainder of the division by an irreducible poly- 
nomial may remain relatively expensive, especially for short strings since they require 
few multiplications. A common technique to quickly compute the remainder is the 
Barrett reduction algorithm |27|. The adaptation of this reduction to GF(2)[x] is es- 
pecially convenient i28| when we choose the irreducible polynomial p{x) such that 
degree(p(a::) — a;^) < L/2, that is, when we can write it as p{x) = X]l=o^^ '^i^i + 
(There are such irreducible polynomials for L g {1,2,..., 400} ||29]| and we conjec- 
ture that such a polynomial can be found for any L ll30l .) In this case, the remainder of 
q{x)/p{x) is given by 

(((((7^2^)*p)~2^)*p)®q) mod 2^ 

where q and p are the 2L — l-bit and L -f l-bit integers representing q{x) and p{x). 



(See Appendix B for implementation details.) We expect the two carry-less multipli- 
cations to account for most of the running time of the reduction. 

Unfortunately, in its current Intel implementation, carry-less multiplications have 
significantly reduced throughput compared to regular integer multiplications. Indeed, 
with pipelining, it is possible to complete one regular multiplication per cycle, but only 
one carry-less multiplication every 8 cycles [31 J . However, using a result from §|2] we 
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can reduce the number of multiplications by half if we compute 

/n/2 

h{s) = mi e ^{■m2i + S2i-i) * {m2i+i + S2i) 

instead. (Henceforth, we refer to last variation as GF MULTILINEAR-HM whereas we 
refer to version based on Equation[6]as GF MULTILINEAR.) Yet even a fast implemen- 
tation of Barrett reduction will still be much slower than merely selecting the left-most 
L bits as in MULTILINEAR. 

However, the carry-less approach might still be preferable to the schemes described 
in §|3](e.g., MULTILINEAR) because fewer random bits are required. Indeed, to gener- 
ate L-bit hash values from n-character strings, the carry-less scheme used (n+ l)i ran- 
dom bits, whereas Multilinear requires 2L + n{2L — 1) random bits. 



5. Experiments 

Our experiments show the following results: 

• It is best to implement MULTILINEAR with loop unrolling. With this optimiza- 
tion. Multilinear is just as fast (on Intel processors) as Multilinear-HM, 
even though it has twice the number of multiplications. In general, proces- 
sor microarchitectural differences are important in determining which method 
is fastest. (§|53| 

• In the absence of processor support for carry-less multiplication (see § |4]l, hash- 
ing using Multilinear over binary finite fields is an order of magnitude slower 



than Multilinear even when using a highly optimized library. (§ 5.3 1 



• Even with hardware support for carry-less multiplication, hashing using Multi- 
linear over binary finite fields remains nearly an order of magnitude slower than 



Multilinear. (§5.4i 



Given a 64-bit processor, it is noticeably faster to use a word size of 64 bits 
even though a larger word size (128 bits) uses fewer random bits (33% less). 
Use of multiprecision arithmetic libraries can further reduce the overhead from 
accessing random bits, but they are also not competitive with respect to speed. 



though they can halve the number of required random bits. (§ 5.5 i 



Multilinear is generally faster than popular string-hashing algorithms. (§ 5.6 1 



5.7. Experimental setup 

We evaluated the hashing functions on the platforms shown in Table [T] Our soft- 
ware is freely available online |32|. For Intel and AMD processors, we used the pro- 
cessor's time stamp counter (rdtsc instruction |33|) to estimate the number of cycles 
required to hash each byte. Unfortunately, the ARM instruction set does not provide 
access to such a counter. Hence, for ARM processors (Apple A4 and Nvidia Tegra), 
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Table 1 : Platforms used. 



Processor 


Bits 


GCC version 


Flags, besides -03 -funroll-loops 


64-bit processors 


Intel Core 2 Duo 


64 


GNU GCC 4.6.2 


-march=core2 -mno-sse2 


Intel Xeon X5260 


64 


GNU GCC 4.1.2 


-march=nocona 


Intel Core i7-860 


64 


GNU GCC 4.6.2 


-march=corei7 -mno-sse2 


Intel Core i7-2600 


64 


GNU GCC 4.6.2 


-march=corei7 -mno-sse2 


Intel Core i7-2677M 


64 


GNU GCC 4.6.2 


-march=corei7 -mno-sse2 


AMD Sempron 3500+ 


64 


GNU GCC 4.4.3 


-march=k8 -mno-sse2 


AMD VI 20 


64 


GNU GCC 4.4.3 


~march=amdfamlO -mno-sse2 


32-bit processors 


Intel Atom N270 


32 


GNU GCC 4.5.2 


-march=atom 


Apple A4 


32 


GNU GCC 4.2.1 


-march=armv7 


Nvidia Tegra 2 


32 


GNU GCC 4A.i^ 




VIA Nehemiah 


32 


GNU GCC 3.3.4 


-march=i686 



"From the Android NDK, configured for the android-9 platform, and used on a Motorola XOOM. 



we estimated the number of cycles required by dividing the wall-clock time by the 
documented processor clock rate (1 GHz). 

For the 64-bit machines, 64-bit executables were produced and all operations were 
executed using 64-bit arithmetic except where noted. All timings were repeated three 
times. For the 32-bit processors, 32-bit operations were used to process 16-bit strings. 
Therefore, results between 32- and 64-bit processors are not directly comparable. Good 
optimization flags were found by a trial-and-error process. We note that using profile- 
guided optimizations did not improve this code any more than simply enabling loop 
unrolling (-funroll-loops). With (only) versions 4.4 and higher of GCC, it was impor- 
tant for speed to forbid use of SSE2 instructions when compiling MULTILINEAR and 

MULTILINEAR-HM. 

We found that the speed is insensitive to the content of the string: in our tests we 
hashed randomly generated strings. We reuse the same string for all tests. Unless 
otherwise specified, we hash randomly generated 32-bit strings of 1024 characters. 

In addition to MULTILINEAR and MULTILINEAR-HM we also implemented MUL- 



TILINEAR (2-by-2) which is merely MULTILINEAR with 2-by-2 loop unrolHng (see |Ap- 
Ipendix A| for representative C implementations). 

Our timings should represent the best possible performance: the performance of a 
function may degrade 1,21 J when it is included in an application because of bandwidth 
and caching. 

5.2. Reducing the multiplications or unrolling may fail to improve the speed 

We ran our experiments over both the 32-bit and 64-bit processors. For the 32-bit 
processors, we generated both 16-bit and 32-bit hash values. Our experimental results 
are given in Table |2] 

We see that over 64-bit Intel processors ( except for the 17-2600 ), there is little 
difference between MULTILINEAR, MULTILINEAR (2-by-2) and MULTILINEAR-HM, 
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Table 2: Estimated CPU cycles per byte for fast Multilinear hashing 



Multilinear 2-by-2 Multilinear-HM 



64-bit processors and 32-bit hash values and characters 



Intel Core 2 Duo 


0.54 


0.52 


0.52 


Intel Xeon X5260 


0.50 


0.50 


0.50 


Intel Core 17-860 


0.42 


0.42 


0.42 


Intel Core 17-2600 


0.34 


0.27 


0.28 


Intel Core i7-2677M 


0.25 


0.20 


0.20 


AMD Sempron 3500h- 


0.63 


0.60 


0.40 


AMD V120 


0.63 


0.63 


0.40 


64-blt arithmetic and 32-bit hash values and characters on 32-bit processors 


Intel Atom N270 


4.2 


4.2 


3.6 


Apple A4 


3.0 


2.7 


3.3 


Nvidia Tegra 2 


3.3 


3.0 


4.9 


VIA Nehemiah 


12 


12 


8.2 


32-bit processors and 16-bit hash values and characters 


Intel Atom N270 


2.1 


3.5 


2.6 


Apple A4 


1.9 


2.6 


1.7 


Nvidia Tegra 2 


1.8 


2.2 


1.9 


VIA Nehemiah 


5.2 


5.2 


3.6 



even though Multilinear and Multilinear (2-by-2) have twice the number of 
multiplications. We beUeve that Intel processors use aggressive pipeUning techniques 
well suited to these computations. On AMD processors, MULTILINEAR-HM is the 
clear wiimer, being at least 33% faster. 

For the 32-bit processors, we get different vastly different results depending on 
whether we generate 16-bit or 32-bit hash values. 

• As expected, it is roughly twice as expensive to generate 32-bit hash values than 
to generate 16-bit values. 

• For the VIA processor, MULTILINEAR-HM is 45% faster than MULTILINEAR 
and Multilinear (2-by-2). We suspect that the computational cost is tightly 
tied to the number of multipUcations. 

• When the 32-bit ARM-based processors generate 32-bit hash values. Multi- 
linear (2-by-2) is preferable. We are surprised that MULTILINEAR-HM is 
the worse choice. We beheve that this is related to the presence of a multiply- 
accumulate instruction in ARM processors. When generating 16-bit hash values. 
Multilinear (2-by-2) becomes the worse choice. There is no significant ben- 
efit to using Multilinear-HM as opposed to Multilinear. 

• The Intel Atom processor benefits from MULTILINEAR- HM when generating 
32-bit hash value, but Multilinear is preferable to generate 16-bit hash values. 
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As with the ARM-based processors, MULTILINEAR (2-by-2) is a poor choice for 
generating 16-bit hash values. 



5.3. Binary-finite-field libraries are not competitive 

We obtained the mpF;, library from INRIA. This code is reported f34] to be gener- 
ally faster than popular alternatives such as NTL and Zen, and our own tests found it 
to be more than twice as fast as Plank's library ||35]| . 

We computed Multilinear in GF{2'^^), using the version with half the number of 
multiplications (see Equation [TJ because the library does much more work in multipli- 
cation than addition. Even so, on our Core 2 Duo, hashing 32-bit strings of 1024 char- 
acters was an order of magnitude slower than MULTILINEAR: averaged over a million 
attempts, the code using mpFb required an average of 7.69 fis per string, compared with 
0.78 fis for Multilinear. While our implementation of Multilinear uses twice as 
many random bits as Multilinear in GF{2'^'^), this gain is offset by the memory usage 
of the finite-field library. 



5.4. Hardware-supported carry-less multiplications are not fast enough 

Intel reports a throughput of one carry-less product every 8 cycles |31 1 on a proces- 
sor such as the Intel Core 17-2600. Consider GF MULTILINEAR-MH: it uses one carry- 
less multiplication for every two 32-bit characters. Hence, it requires at least 4 cycles to 
process each character. Hence, in the best scenario possible, GF MULTILINEAR-MH 
will be four times slower than MULTILINEAR-MH which requires only 1.1 cycles per 
32-bit character ( 0.28 cycle per byte). 

To assess the actual performance, we implemented both GF MULTILINEAR and 



GF MULTILINEAR-MH in C (§ Appendix B i. Of the processors we tested, only 



the 17-2600 has support for the CLMUL instruction set. If we use the flags -03 
-f unroll-loops -Wall -maes -msse4 -mpclmul, we get 12.5 CPU cycles 
per 32-byte chai-acter with GF MULTILINEAR and only 7.2 CPU cycles with GF 
MULTILINEAR-MH. We might be able to improve our implementation: e.g., we expect 
that much time is spent loading data into XMM registers. However, the throughput of 
the carry-less multiplication limits the character throughput of GF MULTILINEAR and 
GF MULTILINEAR-MH to 8 and 4 cycles. On the bright side, GF MULTILINEAR and 
GF MULTILINEAR-MH require half the number of random bits. 



5.5. The sweet-spot for multiprecision arithmetic is not sweet enough 

To implement the techniques of § 3.2 we used the GMP library f36l version 5.0.2 
to implement MULTILINEAR (2-by-2). As usual, we are hashing 4 kB of data, though 
data to be hashed are read in large chunks (up to 2048 bits). The hash output is always 
32 bits (z — 32). Results show a benefit as the chunk size L goes from 32 to 512 
bits, but thereafter the situation degrades. See Fig. [3] In the best case, using 512- 
bit arithmetic, we require almost 13 /xs per string on the Core 2 Duo platform. For 
comparison, we find that the fewest random bits would be needed when L = 1024 
(§ 3.2 1. As expected, the running time is minimized for a lower value of L to account 
for the superlinear running cost of multiplications. 
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Unfortunately, we can do 12 times better without the GMP library (0.78 /is for 64- 
bit Multilinear) so it is not practical to use 512-bit arithmetic, even though it uses 
fewer random bits (nearly half as many). 

As a lightweight alternative to a multiprecision library, we experimented with the 
__uintl28 type provided as a GCC extension for 64-bit machines. We used 128-bit 
random numbers and processed three 32-bit words with each 128-bit operation. Since 
__uintl28 multiplications are more expensive than __uintl28 additions, we tested the 
MULTILINEAR-HM scheme. On our Core 2 Duo machine, the result was 38% slower 
than Multilinear (2-by-2) using 64-bit operations. This poor results is mitigated by 
the fact that we use 128 random bits per 96 input bits, versus 64 random bits per 32 in- 
put bits (a saving of nearly 33% for long strings). Investigation using hardware per- 
formance counters showed many "unaligned loads" from retrieving 128-bit quantities 
when we step through memory with 96-bit steps. To reduce this, we tried processing 
only two 32-bit words with each 128-bit operation, since we retrieved aligned 64-bit 
quantities. However, the result was 61% slower than MULTILINEAR (2-by-2) using 
64-bit operations. 

5.6. Strongly universal hashing is inexpensive? 

In a survey, Thorup [1] concluded that strongly universal hash families are just as 
efficient, or even more efficient, than popular hash functions with weaker theoretical 
guarantees. However, he only considered 32-bit integer inputs. We consider strings. 

In Table|2j we compare the fastest MultiHnear (MULTILINEAR-HM) with two non- 
universal fast 32-bit string hash functions, Rabin- Karp ll37l and SAX ll38l . (They are 
similar to hash functions found in programming languages such as Java or Perl.) Even 
though these functions were designed for speed and lack strong theoretical guarantees, 
they are far slower than MULTILINEAR on desktop processors (AMD and Intel). Only 
for ARM processors (Apple A4 and Nvidia Tegra 2) with 32-bit hash values are they 
much faster. We suspect that this good result on ARM processors is due to the multiply- 
accumulate instruction. Clearly, such a multiply-accumulate operation greatly benefits 
simple hashing functions such as Rabin-Karp and SAX. 
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Table 3: A comparison of estimated CPU cycles per byte between fast Multilinear 
hashing and common hash functions 





Rabin-Karp 


SAX 


best Multihnear 


32-bit hash values and characters on 64-bit processors 


Intel Core 2 Duo 


1.3 


1.3 


0.52 


Intel Xeon X5260 


1.4 


1.6 


0.50 


Intel Core 17-860 


1.4 


1.6 


0.42 


Intel Core 17-2600 


0.89 


1.1 


0.27 


Intel Core i7-2677M 


0.64 


0.82 


0.20 


AMD Sempron 3500+ 


1.0 


1.5 


0.40 


AMD V120 


1.0 


1.5 


0.40 


32-bit hash values and characters on 32-bit processors 


Intel Atom N270 


1.1 


2.0 


4.2 


Apple A4 


0.88 


1.2 


2.7 


Nvidia Tegra 2 


0.85 


1.2 


3.0 


VIA Nehemiah 


2.0 


3.0 


8.2 


16-bit hash values and characters on 32-bit processors 


Intel Atom N270 


2.1 


4.1 


2.2 


Apple A4 


1.8 


2.1 


1.8 


Nvidia Tegra 2 


1.6 


2.4 


1.7 


VIA Nehemiah 


5.0 


6.6 


3.6 
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Table 4: A comparison of estimated CPU cycles per byte between fast Multilinear 
hashing and the almost universal hash function NH from Black et al. ifTTl for 32-bit 
hash values using 64-bit arithmetic. When running NH tests, we remove the -mno-sse2 
flag where it is present for better results. 





NH fill 


best Multilinear 


Intel Core 2 Duo 


0.53 


0.52 


Intel Xeon X5260 


0.50 


0.50 


Intel Core i7-860 


0.42 


0.42 


Intel Core i7-2600 


O.lfpl 


0.27 


hitel Core i7-2677M 


o.ir 


0.20 


AMD Sempron 3500h- 


0.38 


0.40 


AMD VI 20 


0.38 


0.40 



"We use the -march=corei7-avx flag for best results. 

Crosby and Wallach fO] showed that almost universal hashing could be as fast as 
common deterministic hash functions. One of their most competitive almost universal 
schemes is due to Black et al. 1 17']. Their fast family of hash functions is called NH: 

)i/2 

h{s) = ^{m2^-l + S2t-i mod 2^/^)(TO2i + S2i mod 2^/^) mod 2^. 

i=l 

NH is almost universal over fixed-length strings, or over variable-length strings that 
do not end with the zero character; we can apply it to strings having odd length by 
appending a character with value zero. It fails to be uniform: the value 

(mi + si mod 2^/^)(to2 + S2 mod 2^/^) 

is zero whenever either nii + si mod 2-^/^ or m2 + S2 mod 2^/^ is zero, which 
occurs with probability - — ^ — - > over all possible values of mi, m2- Moreover, 
the least significant bits may fail to be almost universal: e.g., for L = 6, there are 
96 pairs of distinct strings colliding with probability 1 over the least two significant 
bits. When processing 32-bit characters, it generates 64-bit hash values with collision 
probability of l/2'^'^. Hence, in our tests over 32-bit characters, NH generates 64-bit 
hash values whereas the Multilinear families generate 32-bit hash values, but both have 
a collision probability bounded by l/2'^-^. Thus, while NH saves memory because it 
uses nearly half the number of random bits compared to our fast Multilinear families. 
Multilinear families may save memory in a system that stores hash values because their 
hash values have half the number of bits. Table |4] shows that the 64-bit NH on 64-bit 
processors runs at about the same speed as the best Multilinear on most processors. 
Only on some Intel Core i7 processors (2600 and 2677M), NH's running time is 60% 
of Multilinear when we enable SSE support. In other words, sacrificing theoretical 
guarantees does not always translate into better speed. 
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Overall, these numbers indicate that strongly universal string hashing is computa- 
tionally inexpensive on most Intel and AMD processors. To gain good results with the 
various 64-bit processors, we recommend MULTILINEAR-HM. Unfortunately — over 
long strings — strongly universal hashing requires many random numbers. Generating 
and storing these random numbers is the main difficulty. Whether this is a problem 
depends on the memory available, the CPU cache, the application workload and the 
length of the strings. (Intel researchers reported the generation of true random num- 
bers in hardware at high speed (4 Gbps) fi9\.) In practice, unexpectedly long strings 
may require the generation of new random numbers while hashing a given string ||9l. 
This overhead should be relatively inexpensive if we know the length of each string 
before we process it. 

6. Conclusion 

Over moderately long 32-bit strings (« 1024 characters), current desktop processors 
can achieve strongly universal hashing with no more than 0.5 CPU cycle per byte, and 
sometimes as little as 0.2 CPU cycle per byte. Meanwhile, at least twice as many cycles 
are required for Rabin-Karp hashing even though it is not even universal. 

While it uses half the number of multiplications, we have found that MULTILINEAR- 
HM is often no faster than MULTILINEAR on Intel processors. Clearly, Intel's pipelin- 
ing architecture has some benefits. 

For AMD processors, MULTILINEAR-HM was faster (w 33%), as expected be- 
cause it uses fewer multiplications. Yet another alternative, MULTILINEAR (2-by-2), 
was slightly faster (ss 15%) for 32-bit hashing on the mobile ARM-based processors 
even though it requires twice as many multiplication as MULTILINEAR-HM. These 
mobile ARM-based processors also computed 32-bit Rabin-Karp hashing with fewer 
cycles per byte than many desktop processors. We believe that this is related to the 
presence of a multiply-accumulate in the ARM instruction set. 
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Appendix A. Implementations in C 

We implemented the following hash functions: 

• Multilinear: h{s) = mi + X]r=i "^i+i^i 

• Multilinear (2-by-2): h{s) = mi + YTi=\ ^21821-1 + S2im2i+i 

• MULTILINEAR-HM: h{s) = TOi + I]"il('^2i + S2i-l){s2i + m2i+l) 

For simplicity we assume that the number of characters (n) is even. Following a com- 
mon convention, we write the unsigned 32-bit and 64-bit integer data types as uint 32 
and uint 64. The variable p is a pointer to the initial value of the string whereas endp 
is a pointer to the location right after the last 32-bit character of the string . The variable 
m is a pointer to the 64-bit random numbers. (When using 63-bit random numbers as 
allowed by Theorem[T[ the right shifts should be by 31 instead. In practice, we use 64- 
bit numbers.) On some compilers and processors, it was useful to disable SSE2: under 
GNU GCC we can achieve this result with function attributes (e.g. by preceding the 
function declaration by attribute (( target { "no-sse2 " ) ) ) ). 
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Multilinear. 

uintSl hash(uint64 * m, uintSl * p, uint32 * endp) { 
uint64 sum = *(m++); 
for ( ; p ! = endp ;++m,++p) { 
sum+= *in * *p; 

} 

return sum>>32; 

} 

Multilinear (2-by-2). 

uint32 hash(uint64 * m, uint32 * p, uint32 * endp) { 
uint64 sum = *(m++); 
for(; p!= endp; m+=2,p+=2 ) { 

sum+= (*m * *p) + (*(m+ 1) * *(p+l)); 

} 

return sum>>32; 

} 

MULTILINEAR-HM. 

uint32 hash(uint64 * m, uint32 * p, uint32 * endp) { 
uint64 sum = *(m++); 
for ( ; p ! = endp ;m+=2,p+=2 ) { 
sum += (*m + *p) * (*(m+l) + *(p+l)); 

} 

return sum>>32; 

} 



Appendix B. Implementations with carry-less multiplications 

We implemented Multilinear in GF{2^^) in C using the Carry-less Multiplication 
(CLMUL) instruction set [26 1 supported by recent Intel and AMD processors. We also 
implemented the counterpart to MULTILINEAR-HM which executes half the number 
of multiplications. 



We use the same conventions as in Appendix A regarding the variables p and m 



except that the later is a pointer to 32-bit random numbers. We wrote our C programs 
using SSE intrinsics: they are functions supported by several major compilers (includ- 
ing GNU GCC, Intel and Microsoft) that generate SIMD instructions. 

The Barrett reduction algorithm is adapted from Knezevic et al. |28 |. The variable 
C contains the chosen irreducible polynomial. We initialize it as 

C = _mm_set_epi64x (0, 1UL+ (1UL<<2)+ (1UL<<6) 
+ (1UL<<7)+ (1UL<<32) ) ; . 
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Barrett reduction. 



uint32 barrett( __ml28i A) { 

__ml28i Ql = _mm_srli_epi64 (A, n) ; 

__ml28i Q2 = _mm_clmulepi64_si 1 28 ( Ql , C, 0x00); 

__ml28i Q3 = _mm_srli_epi64 (Q2, n) ; 

__ml28i f = _mm_xor_sil28 (A, 

_mm_clmulepi64_sil28 ( Q3, C, 0x00)); 
return _mm_cvtsil28_si64 ( f ) ; 



GF Multilinear. 

uint32 hash(uint32 * m, uint32 * p, uint32 * endp) { 
__ml28i sum = _mm_set_epi64x (0 , * (m++) ) ; 
for ( ; p ! = endp ;++m,++p ) { 

__ml28i t = _mm_set_epi64x (*m, *p) ; 

__ml28i c = _mm_clmulepi64_sil28 ( t, t, 0x10); 

sum = mm xor sil28 (c.sum); 

} 

return barret (sum); 

} 



GF MULTILINEAR-MH. 

uint32 hash(uint32 * m, uint32 * p, uint32 * endp) { 
__ml28i sum = _mm_set_epi64x (0 ,* (m++) ) ; 
for (;p! = endp ;m+=2,p+=2 ) { 

__ml28i tl = _mm_set_epi64x (*m, * (m+ 1) ) ; 
__ml28i t2 = _mm_set_epi64x (*p , * (p + 1) ) ; 
__ml28i t = _mm_xor_sil28 (tl , t2 ) ; 
__ml28i c = _mm_clmulepi64_sil28 ( t, t, 0x10); 
sum = mm xor sil28 (c,sum); 

} 

return barret (sum); 

} 



26 



